1 Introduction

Amplicon sequencing is called a lot of different things: metabarcoding, tag-sequencing, rRNA sequencing, 16S, 18S, ribosomal RNA, metagenomics (this one is not technically correct), marker-gene sequencing, etc.

Essentially, amplicon sequencing refers to PCR amplifying part of a particular gene (based on the taxonomic group of interest), then sequencing those amplicons for many samples of mixed origin (e.g. microbiome, soil, ocean). The goal is usually to asses the microbial diversity and community composition in different samples and ultiately determine if different microbes are more or less important in different environments.

Over the years, there have been several tools developed to analyze this type of data: Mothur, Qiime, Uparse, Usearch, DADA2, MED, Qiime2.

Current best practices are denoising algorithms (rather than clustering as was used previously).

DADA2, Unoise, and deblur are denoising algorithms

Qiime2 wraps DADA2 (or optionally deblur) to denoise amplicon data.

DADA stands for Divisive Amplicon Denoising Algorithm.

Denoising refers to modeling sequencing errors in a data set in order to identify truly unique sequences (Amplicon Sequence Variants, ASVs).

You can read more about DADA2 here: https://www.nature.com/articles/nmeth.3869

2 Install Qiime2

https://qiime2.org/

You can install Qiime2 with conda!

wget https://data.qiime2.org/distro/core/qiime2-2019.4-py36-linux-conda.yml

conda env create -n qiime2-2019.4 --file qiime2-2019.4-py36-linux-conda.yml

Activate your qiime2 environment:

source activate qiime2-2019.4

3 Data and metadata

In this day and age, you will typically receive paired-end demultiplexed data from your sequencing center.

Therefore, you will have a directory that will look something like this:

RSJ2_A10_S73_L001_R1_001.fastq.gz   RSJ2_B2_S10_L001_R1_001.fastq.gz    
RSJ2_A10_S73_L001_R2_001.fastq.gz   RSJ2_B2_S10_L001_R2_001.fastq.gz    
RSJ2_A11_S81_L001_R1_001.fastq.gz   RSJ2_B3_S18_L001_R1_001.fastq.gz    
RSJ2_A11_S81_L001_R2_001.fastq.gz   RSJ2_B3_S18_L001_R2_001.fastq.gz    
RSJ2_A12_S89_L001_R1_001.fastq.gz   RSJ2_B4_S26_L001_R1_001.fastq.gz    

All of your sequences files should be names following this convention: uniqueID_S##_L001_R1_001.fastq.gz and you should have an R1 and R2 file for each unique ID.

All the files you want to include in your analysis should be in one directory and NOTHING else should be in that directory.

You will also need a metadata file that looks somethings like this:

Sample ID       Treatment       Gene
RSJ1_A1         t3 H1           16S
RSJ1_B1         t6 C2           16S
RSJ1_C1         t6 H4           16S
RSJ1_D1         t4 C3           16S
RSJ1_E1         t3 H1           18S
RSJ1_F1         t6 C2           18S
RSJ1_G1         t6 H4           18S
RSJ1_H1         t4 C3           18S
RSJ1_A2         t2 H1           16S

It should be a tab-separated .txt file. The first column has to be Sample ID and include the entire “uniqueID” before the _S##_L001_R1_001.fastq.gz in the file name.

4 Denoising your data and getting a feature table

4.1 Quality Checking

For denoising, you would like to remove low quality bases at the beginning and end of your reads, but all of your sequences have to be the same length. Therefore, we can look at summaries of sequence quality and choose cut offs for the beginning and end of the sequences overall.

Import your data to qiime:

qiime tools import \
--type 'SampleData[PairedEndSequencesWithQuality]' \
--input-path ./RSJ1 \
--input-format CasavaOneEightSingleLanePerSampleDirFmt \
--output-path RSJ1_16S_demux-paired-end.qza

The --input-path points to the directory with all of the sequences you are included in your analysis.

Next, convert the .qza file into a .qzv file.

qiime demux summarize \
--i-data RSJ1_16S_demux-paired-end.qza \
--o-visualization RSJ1_16S_demux.qzv

All .qzv files can be dragged into the qiime2 website (https://view.qiime2.org/) to be rendered and viewed interactively.

In this example, we can see that quality starts to drop off in Forward reads around 280 bp and at around 260 bp in the reverse reads. Therefore, we will trucate forward reads at 280bp and truncate reverse reads at 260bp. Just to be sure that we are getting rid of primer seqs, we will also trim the first 10bp of forward and reverse reads.

4.2 Denoising with DADA2

qiime dada2 denoise-paired \
    --i-demultiplexed-seqs RSJ1_16S_demux-paired-end.qza \
    --output-dir ./dd2RSJ1 \
    --o-representative-sequences RSJ1_16S_rep-seqs \
    --p-trim-left-f 10 \
    --p-trim-left-r 10 \
    --p-trunc-len-f 280 \
    --p-trunc-len-r 260 \
    --p-n-threads 3 

I ran this on my laptop, which can have up to 4 threads running, so I ran with --p-n-threads equal to 3. If you are running on an HPC, you can have more threads running! Running this on your local machine takes tiiiiime (up to 24 hrs for a single MiSeq run worth of data).

This command will make files, table.qza and denoising_stats.qza within the --output-dir. In order to summarize the results in these tables, use the following commands:

qiime feature-table summarize \
    --i-table ./dd2RSJ1/table.qza \
    --o-visualization ./dd2RSJ1/table.qzv \
    --m-sample-metadata-file ./RSJ1samplemap.txt   

    qiime metadata tabulate \
    --m-input-file dd2RSJ1/denoising_stats.qza \
    --o-visualization dd2RSJ1/denoising_stats.qzv

Again, both of these files can be pulled over to https://view.qiime2.org/.

Denoising Stats

The denoising stats shows you how many raw reads existed in each sample and how many were filtered out at each step.

Feature table summary

The table summary will show you how many reads per sample are in the final data set and will tell you how many “features” or ASVs are in your data set.

5 Taxonomy

First you have to train the feature classifier on the database that you choose (or download a pre-trained classifier):

Choose your reference database or pre-trained classifier from here: https://docs.qiime2.org/2019.4/data-resources/

An additional database for microbial euks = https://github.com/pr2database/pr2database

If you download the entire SILVA database (132) for example:

Set variables for the path to the representative sequences and the taxonomy assignments for those sequences:

Rep_set SILVA97otus=/Users/brisbin/Desktop/SILVA_132_QIIME_release/rep_set/rep_set_16S_only/97/silva_132_97_16S.fna

Taxonomy Tax97=/Users/brisbin/Desktop/SILVA_132_QIIME_release/taxonomy/16S_only/97/consensus_taxonomy_all_levels.txt

Then convert the representative sequences and the taxonomy to qiime files:

    qiime tools import \
    --type 'FeatureData[Sequence]' \
    --input-path $SILVA97otus \
    --output-path 97_otus16.qza

    qiime tools import \
    --type 'FeatureData[Taxonomy]' \
    --source-format HeaderlessTSVTaxonomyFormat \
    --input-path $Tax97 \
    --output-path 97ref-taxonomy16.qza

Next, train the classifier: (you can optionally select only the gene region you used in your study as the classifier training sequence by supplying your primer sequences, this sometimes leads to more accurate classification)

    qiime feature-classifier fit-classifier-naive-bayes \
    --i-reference-reads 97_otus16.qza\
    --i-reference-taxonomy 97ref-taxonomy16.qza\
    --o-classifier 97classifier16.qza

Finally, assign taxonomy to your ASVs:

    qiime feature-classifier classify-sklearn \
    --i-classifier 97classifier16.qza \
    --i-reads 16rep-seqs.qza \
    --o-classification 16taxonomy.qza

and convert the taxonomy tor a view file:

    qiime metadata tabulate \
    --m-input-file 16taxonomy.qza \
    --o-visualization 16taxonomy.qzv

pull over to https://view.qiime2.org/

There is a nice package that pulls .qza files directly into R, but for taxonomy it only works with certain databases (not SILVA). As a result, its pretty convenient to just click the “Download metadata TSV file” button and then import the .tsv file to R.

You can examine alpha and beta diversity with Qiime2, but you get a lot more control and can make much better looking figures if you do these analyses with phyloseq.

6 Phyloseq

Phyloseq is a really functional package that helps out with almost all of the analyses you might be interested in performing on microbial amplicon data.

Phyloseq publication: https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0061217

Phyloseq documentation: https://joey711.github.io/phyloseq/import-data.html

Install Phyloseq:

source('http://bioconductor.org/biocLite.R')
biocLite('phyloseq')

6.1 data import

Qiime2R is superhelpful for importing qiime2 artifacts for Phyloseq!

Install Qiime2R

install.packages("devtools") #if you don't have devtools installed
devtools::install_github("jbisanz/qiime2R")

Load the qiime2R and phyloseq packages:

library(qiime2R)
library(phyloseq)
## Warning: package 'phyloseq' was built under R version 3.5.2

we also need ggplot & tidyr:

library(ggplot2)
library(tidyr)
## Warning: package 'tidyr' was built under R version 3.5.2

load feature table: qiime2R works beautifully with the feature table …

phyloseq<-qza_to_phyloseq(features="JuneOctMergedTable.qza")
## Warning in read_qza(features): Artifact was not generated with Qiime2
## 2018.4, if data is not successfully imported, please report here
## github.com/jbisanz/qiime2R/issues

This alone makes it great. The feature table is the most complicated data artifact to deal with; withou qqime2R, we would need to convert the the .qza to a .biom. This is essentially what qiime2R is doing under the hood for us.

Load metadata:

metatable <- read.csv("JuneOctSampleMap.csv", header = TRUE)
row.names(metatable) <- metatable[["SampleID"]]
library("dplyr")
## Warning: package 'dplyr' was built under R version 3.5.2
metatable <- metatable %>% select(SampleID, everything())
META<- sample_data(metatable)

load taxonomy:
Unfortunately qiime2R doesn’t play well with SILVA taxonomy, so we need to do this manually.

taxonomy <- read.csv("JuneOctTaxonomy.csv", stringsAsFactors = FALSE)
names(taxonomy) <- c("row", "tax", "Confidence") #change the headers (column names)
row.names(taxonomy) <-taxonomy[[1]] #move the feature ID column to become the row names
taxonomy <- taxonomy[,(-1)] #delete the feature ID  column 

The tax column currently looks like this: D_0__Bacteria;D_1__Cyanobacteria;D_2__Melainabacteria;D_3__Obscuribacterales;D_4__uncultured bacterium;D_5__;D_6__;D_7__;D_8__;D_9__;D_10__;D_11__;D_12__;D_13__;D_14__ for each feature. We want to separate the taxonomy levels into different columns and we dont really need to hold on to D_9__ + because these levels are not used for any of our feautures

taxonomy <-  separate(taxonomy, tax, c("D0","D1", "D2", "D3", "D4", "D5", "D6", "D7", "D8", "D9", "D10", "D11", "D12", "D13", "D14"), sep = ";", fill = "right")
taxonomy <- taxonomy[,c(1:8)]

Convert the taxonomy dataframe to a matrix, and then to a phyloseq object:

taxmat <- as.matrix(taxonomy)
TAX = tax_table(taxmat)

add taxonomy and metadata to phyloseq object:

ps = merge_phyloseq(phyloseq, TAX, META)

Take a look at your phyloseq object:

ps
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 36007 taxa and 182 samples ]
## sample_data() Sample Data:       [ 182 samples by 9 sample variables ]
## tax_table()   Taxonomy Table:    [ 36007 taxa by 8 taxonomic ranks ]

This is actually a pretty big data set with 182 samples.

We can use phyloseq’s great subsetting options to grab a portion of this dataset:

ps<- subset_samples(ps, Type == "Field" & Month == "June" & Treat != "RS")
ps
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 36007 taxa and 12 samples ]
## sample_data() Sample Data:       [ 12 samples by 9 sample variables ]
## tax_table()   Taxonomy Table:    [ 36007 taxa by 8 taxonomic ranks ]

6.1.1 Prevalence Filtering

Prevalence combines ASV abundance and the number of samples an ASV is found in. If an ASV has really low prevalence, it might be a mistake. It is common to apply 5% prevalence filtering, but it really depends on your data whether this is the best thing to do or not. If you are interested in rare taxa, or if your samples come from really different sources, you probably don’t want to apply prevalence filtering. Regardless, its a good initial check that your count data makes sense.

prevdf = apply(X = otu_table(ps),
               MARGIN = ifelse(taxa_are_rows(ps), yes = 1, no = 2),
               FUN = function(x){sum(x > 0)})

prevdf = data.frame(Prevalence = prevdf,
                    TotalAbundance = taxa_sums(ps),
                    tax_table(ps))

prevalence plot:

prevplot1<-ggplot(prevdf, aes(TotalAbundance, Prevalence / nsamples(ps),color=D1)) +
  geom_hline(yintercept = 0.05, alpha = 0.5, linetype = 2) +  geom_point(size = 2, alpha = 0.7) + 
  theme_bw()+
  scale_x_log10() +  xlab("Total Abundance") + ylab("Prevalence [Frac. Samples]") +
  facet_wrap(~D1) + theme(legend.position="none")

prevplot1

There are definitely some low prevalence taxa.

Lets just keep the more prevalent groups for relative abundance plots (to decrease the number of colors needed to make the plot, that wont be visible anyways)

highPrev<-  c("D_1__Acidobacter", "D_1__Actinobacter", "D_1__Bacteroidetes", "D_1__Chloroflexi", "D_1__Cyanobacteria", "D_1__Dadabacteria", "D_1__Epsilonbacteraeota", "D_1__Euryarchaeota", "D_1__Firmicutes", "D_1__Fusobacteria", "D_1__Marinimicrobia (SAR406 clade)", "D_1__Planctomycetes", "D_1__Proteobacteria", "D_1__Rokubacteria", "D_1__Verrucomicrobia", "D_1__Gemmatimonadetes")
psNHighPrev<- subset_taxa(ps, D1 %in% highPrev)

Coverting counts to relative abundance helps normalize the count data for library size. This is not the only (or best) way to normalize for library size but it is used with the Bray-Curtis dissimilarity and is necessary for relative abundance bar plots.

physeqPra<- transform_sample_counts(psNHighPrev, function(OTU) 100* OTU/sum(OTU))

Then we can glom the taxa at the D1 level to make the plot cleaner: (it also makes the plot build a lot faster)

glomD1<- tax_glom(physeqPra, "D1")

You can change the order that samples appear on the X axis by adding factor levels to metadata:

metaField <- metatable[metatable$Treat =="A" & metatable$Month == "June" & metatable$Treat != "RS" , ]
metaField$Treatment <- factor(metaField$Treatment, levels=c("A1 6/13", "A2 6/13", "A3 6/13", "A4 6/13", "A1 6/16", "A2 6/16", "A3 6/16", "A4 6/16", "A1 6/19", "A2 6/19", "A3 6/19", "A4 6/19"))
METArs<- sample_data(metaField) # convert adjusted metadata table to phyloseq object
sample_data(glomD1) <- METArs # add new metadata phyloseq object

Some people care a lot about plot colors …

Jcolors is a nice package for plot colors: https://cran.r-project.org/web/packages/jcolors/vignettes/using_the_jcolors_package.html

install Jcolors:

devtools::install_github("jaredhuling/jcolors")

Other color packages I really like:
Wes Anderson movie palettes: https://github.com/karthik/wesanderson
La Croix palettes: https://github.com/johannesbjork/LaCroixColoR
Beyonce palettes: https://github.com/dill/beyonce

Load the Jcolors package:

library(jcolors)

in relative abundance plots you almost always need more colors than any single palette provides. You can cat together several palettes:

j5<- jcolors("pal5")
j6<-jcolors("pal6")
j7<-jcolors("pal7")
j9<-jcolors("pal9")
colors<- c(j7,j9,j6, j5)
colors <- rep(colors, 5)
colors <-c(unname(colors))
colors[9] = "#899DA4"
taxabarplotD1<-plot_bar(glomD1, x= "Treatment", fill = "D1") +  scale_y_continuous(expand = c(0, 0)) + ggtitle("") + scale_fill_manual(values=colors ) + theme(legend.title=element_blank()) + geom_bar(aes(fill=D1), stat="identity", position="stack", width =0.9) +theme_classic() + theme(text = element_text(size=14))+theme(axis.text.x = element_text(angle = 90)) + xlab("Sample") + ylab("Relative Abundance(%)") + theme(text = element_text(size=14))
taxabarplotD1+ theme(legend.position="none")

ggsave is a fabulous function for saving your plots. It is so smart, it can read file extensions and will create whatever type of file you specify with the file extension. It creates a pdf/jpg/png of whatever plot you made last and saves it in your working directory.

ggsave("Relative_Abundance_Plot.pdf", width = 8, height = 5)

Alternatively, you can make interactive taxabarplots in qiime2 for exploratory analysis:

qiime taxa barplot \
--i-table table.qza \
--i-taxonomy taxonomy.qza \
--m-metadata-file samplemap.txt \
--o-visualization taxa-bar-plots.qzv   

7 Alpha Diversity

Phyloseq can compute basic Alpha Diversity metrics:

Alpha diversity should be computed on the full, unfiltered data set.

so we will reorder the raw dataset:

sample_data(ps) <- METArs # add new metadata phyloseq object
plot_richness(ps, measures=c("Observed", "Shannon"), x = "Treatment") + theme_bw() + theme(text = element_text(size=14))+ theme(axis.text.x = element_text(angle = 90))

Breakaway is package that integrates with phyloseq and models species richness https://github.com/adw96/breakaway

and.. allows for significance testing on differences in richness between sample types

install breakaway:

devtools::install_github("adw96/breakaway")

Load Breakaway library:

library(breakaway)
library(tibble)

Run Breakaway and plot the results:

jba <- breakaway(ps)

jbadf<- summary(jba) %>%
  add_column("SampleID" = ps %>% otu_table %>% sample_names)

jbadf<- merge(jbadf, metatable, by = "SampleID")

jbaPlot <- ggplot(jbadf, aes(x=Time, y=estimate, fill= Month)) + geom_boxplot() + theme_bw() + theme(text = element_text(size=14)) +ylab("Richness Estimate") +xlab("") +scale_fill_manual(values=c("#42858C"))

jbaPlot

#ggsave("breakaway_june.png", width = 5, height = 4)

Use the betta function to build a regression model on richness estimates:

obt <- betta(summary(jba)$estimate,
            summary(jba)$error,
            make_design_matrix(jbadf, "Time"))
obt$table
##                  Estimates Standard Errors p-values
## (Intercept)       525.1787        269.1881    0.051
## predictors16-Jun 1985.5775        466.2476    0.000
## predictors19-Jun  290.0347        466.2492    0.534

June 16 is significantly more diverse than June 13 and June 19!

8 Beta Diversity

8.1 Bray-Curtis

For Bray-Curtis, use the relative abundance transformed phyloseq:

ordu = ordinate(physeqPra, "PCoA", "bray")
p<-plot_ordination(physeqPra, ordu, color="Time")+theme_bw() +scale_color_manual(values=colors)+ geom_point(size=3)+
  theme(text=element_text(size=14))
p

Significance testing:

we need another package - vegan http://cc.oulu.fi/~jarioksa/softhelp/vegan.html

install.packages("vegan",repos="http://r-forge.r-project.org")
library(vegan)

set seed to make results reproducible:

set.seed(1)
OTUs <- t(data.frame(otu_table(physeqPra))) #get data frame of symbiont SVs from phyloseq object object
row.names(OTUs) <- gsub("\\.", "-", row.names(OTUs))
meta <- metatable[row.names(metatable) %in% row.names(OTUs),]
meta$Time <-factor( meta$Time , levels=c("13-Jun", "16-Jun", "19-Jun"))# filter sample data to include ONLY the samples included in this analysis. Otherwise, adonis will give an error.

The adonis function from the vegan package runs a PERMANOVA or permutational analysis of variance:

adonis(vegdist(OTUs, method = "bray") ~ Time, data = meta)
## 
## Call:
## adonis(formula = vegdist(OTUs, method = "bray") ~ Time, data = meta) 
## 
## Permutation: free
## Number of permutations: 999
## 
## Terms added sequentially (first to last)
## 
##           Df SumsOfSqs MeanSqs F.Model      R2 Pr(>F)
## Time       2   0.20425 0.10212 0.71096 0.13644  0.574
## Residuals  9   1.29277 0.14364         0.86356       
## Total     11   1.49702                 1.00000

There are not significantly different community compositions between dates.

Can also do pairwise comparisons (but shouldn’t really if the PERMANOVA is not significant..):

install.packages("remotes")
remotes::install_github("vmikk/metagMisc")
library("metagMisc")
tst<-adonis_pairwise(x=meta, dd=vegdist(OTUs, method = "bray"), group.var="Time")
tst$Adonis.tab
##      Comparison         R2         F  df     p  p.adj
## 1 13-Jun.16-Jun 0.11832764 0.8052490 1;6 0.430 0.6495
## 2 13-Jun.19-Jun 0.03762336 0.2345653 1;6 1.000 1.0000
## 3 16-Jun.19-Jun 0.13278006 0.9186601 1;6 0.433 0.6495

8.2 Unifrac

Unifrac incorporates phylogenetic distance to determine the dissimilarity between samples. The metric can be a purely phylogenetic distance (unweighted unifrac) or can include abundance data (nonweighted unifrac).

You can get phylogenetic distances between ASVs using Qiime2:

    qiime alignment mafft \
    --i-sequences rep-seqs.qza \
    --o-alignment aligned-rep-seqs.qza

    qiime alignment mask \
    --i-alignment aligned-rep-seqs.qza \
    --o-masked-alignment masked-aligned-rep-seqs.qza

    qiime phylogeny fasttree \
    --i-alignment masked-aligned-rep-seqs.qza \
    --o-tree 16unrooted-tree.qza

    qiime phylogeny midpoint-root \
    --i-tree unrooted-tree.qza \
    --o-rooted-tree rooted-tree

And Qiime2R will import your rooted tree:

tree<-read_qza("rooted-tree.qza")

then add the phylogenetic tree to the phyloseq object that is relative abundance transformed (%):

ps_uni<- merge_phyloseq(physeqPra, tree$data)

Weighted Unifrac distance PCoA:

ordu = ordinate(ps_uni, "PCoA", "unifrac", weighted=TRUE)
p = plot_ordination(ps_uni, ordu, color="Time")+theme_bw() +scale_color_manual(values=colors)+ geom_point(size=3)+
  theme(text=element_text(size=14))
p

Just like the relative abundance plots, you can use Qiime2 to make interactive ordination plots for data exploration:

qiime diversity core-metrics-phylogenetic \
    --i-phylogeny rooted-tree.qza \
    --m-metadata-file sampleMap.txt \
    --i-table table.qza \
    --p-sampling-depth 73300 \
    --output-dir core-metrics-phylogenetic    

A very important point here is the flag --p-sampling-depth 73300. Running the qiime diversity command in qiime requires that you choose a number of reads to normalize or “rarify” to (in this case 73,300). This means that samples that have more than that number of reads will be randomly subsampled to have exactly that number of reads. Samples that have less reads will be discarded completely. To choose the sampling depth, you often have to make tough decisions about which replictates you can sacrifice versus how much sequencing depth you want to keep. No matter what, you end up throwing out data, which is not preferable. For a deeper discussion check out: https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1003531 by McMurdie & Holmes.

The output of this command will give you:

bray_curtis_distance_matrix.qza
bray_curtis_emperor.qzv
bray_curtis_pcoa_results.qza
evenness_vector.qza
faith_pd_vector.qza
jaccard_distance_matrix.qza
jaccard_emperor.qzv
jaccard_pcoa_results.qza
observed_otus_vector.qza
rarefied_table.qza
shannon_vector.qza
unweighted_unifrac_distance_matrix.qza
unweighted_unifrac_emperor.qzv
unweighted_unifrac_pcoa_results.qza
weighted_unifrac_distance_matrix.qza
weighted_unifrac_emperor.qzv
weighted_unifrac_pcoa_results.qza

as always, the .qzv files are interactive visualizations you can pull over to view.qiime2.org:

Likewise, you can do your PERMANOVA significance testing with Qiime2 (pairwise comparisons are included):

qiime diversity beta-group-significance \
  --i-distance-matrix unweighted_unifrac_distance_matrix.qza \
  --m-metadata-file sample-metadata.tsv \
  --m-metadata-column Time \
  --o-visualization unweighted-unifrac-body-site-significance.qzv \
  --p-pairwise

(this example screenshot is from the Qiime2 Tutorial)

If you want to normalize your count table instead of rerefying it, you can normalize the counts in R and then finish your analysis in R (suggested), or export back to Qiime2.

8.3 Aitchison’s Distance

Amplicon data (really all sequencing data) are compositional. Compositional bias makes it challenging to make conclusions about treatment effects, you can read all about that here: https://www.frontiersin.org/articles/10.3389/fmicb.2017.02224/full

There are some normalization and ordination methods that specifically aim to minimize compositonal bias. Using a centered-log-ratio (clr) normalization with the Euclidean distance metric is referred to as an Aitchison distance. The package CoDaSeq applies a clr normalization on amplicon count tables:

Install CoDaSeq:

devtools::install_github('ggloor/CoDaSeq/CoDaSeq')

load library:

library(CoDaSeq)
OTU4clr<- data.frame(t(data.frame(otu_table(ps))))
row.names(OTU4clr) <- gsub("\\.", "-", row.names(OTU4clr))
OTUs.clr <- codaSeq.clr(OTU4clr + 0.5, samples.by.row=TRUE)
OTU2 <- otu_table(as.matrix(OTUs.clr), taxa_are_rows = FALSE)

psCLR <- phyloseq(OTU2,TAX,META)
ordu = ordinate(psCLR, "PCoA", "euclidean")
p<-plot_ordination(psCLR, ordu, color="Time")+theme_bw() +scale_color_manual(values=colors)+ geom_point(size=4) +  theme(text = element_text(size=14))
p

And again, you can run a PERMANOVA, this time on the clr transformed data and with vegdist set to “euclidean”:

set.seed(1)
adonis(vegdist(OTUs.clr, method = "euclidean") ~ Time, data = meta)
## 
## Call:
## adonis(formula = vegdist(OTUs.clr, method = "euclidean") ~ Time,      data = meta) 
## 
## Permutation: free
## Number of permutations: 999
## 
## Terms added sequentially (first to last)
## 
##           Df SumsOfSqs MeanSqs F.Model      R2 Pr(>F)
## Time       2     19644  9822.2 0.90493 0.16743  0.647
## Residuals  9     97687 10854.1         0.83257       
## Total     11    117331                 1.00000

For this data set, the results changed very slightly when using taxonomic (unifrac) or non-taxonomic (Bray-Curtis) distances/dissimilarities. Additionally, using a normalization method that minimizes compositional bias did not substantially alter the results. The different methods can produce more differentiated results depending on the dataset.

9 Differential Abundance testing with DESeq2

Install DESeq2:

BiocManager::install("DESeq2")

Load library:

library(DESeq2)